VAST Challenge 2021 MC2 (II)

R Visualization Interactive Charts

To visualize & analyze card ownership, potential relationships and suspicious activities with the employee disappearance incident

LIU Yangguang https://www.linkedin.com/in/ygliu/ (School of Computing and Information Systems, Singapore Management)
07-31-2021

Continuing the study from Part 1.

Visualization and Insights

Q3: Infer card owners

Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?

If one consumption time fall with one car stop period, we believe it’s possible one correspondence. There might be several different purchase in one car stop period, but we can infer that the most common pair within one group is the most likely true pair.

The credit card data contain specific time, but loyalty card data doesn’t. Thus, We will find the relationship between credit cards and loyalty cards. After that, we will match the credit card and car stop. The owners of loyalty card can be inferred from the relationship between credit card and car stop and the relationship between credit cards and loyalty cards.

Credit and loyalty cards pairs

We full join the two card table by matching day, location and price. It’s seldom that two different consumption will have the same values in the three features. Then, we use group_by() to find all pairs of two cards and count the consumption frequency of the pair.

There will be some rows which can’t match. This might be someone used only one of the two cards or got cashback. We filter out these situations, where the card pair contains null value.

# # make a full join
card_correspond_count <- full_join(cc, loyalty,
                             by = c("day", "location", "price")) %>%
  # calculate frequency
  group_by(last4ccnum, loyaltynum) %>%
  summarise(count = n()) %>%
  # filter out mismatch
  drop_na()

# convert 'last4ccnum' into string to plot 
card_correspond_count$last4ccnum <- as.character(card_correspond_count$last4ccnum)

Most pairs are one-on-one. It’s confident to conclude there pairs are true (credit and loyalty card in each pair belong to one owner).

card_correspond_count_one2one <- card_correspond_count %>% 
  filter((n_distinct(last4ccnum)==1 & n_distinct(loyaltynum)==1))

knitr::kable(card_correspond_count_one2one,
             caption = "One-on-one matched pairs") %>% 
  kableExtra::kable_paper("hover", full_width = F) %>% 
  kableExtra::scroll_box(height = "300px")
Table 1: One-on-one matched pairs
last4ccnum loyaltynum count
1310 L8012 21
1321 L4149 22
1415 L7783 24
1874 L4424 25
1877 L3014 18
2142 L9637 25
2276 L3317 10
2418 L9018 20
2463 L6886 23
2540 L5947 20
2681 L1107 20
3484 L2490 24
3492 L7814 22
3506 L7761 6
3547 L9362 14
3853 L1485 22
4434 L2169 26
4530 L8477 10
5010 L2459 5
5407 L4034 20
6691 L6267 20
6816 L8148 20
6895 L3366 21
6899 L6267 23
6901 L9363 28
7108 L6544 16
7117 L6417 28
7253 L1682 24
7354 L9254 21
7384 L3800 26
7688 L4164 22
7792 L5756 20
7819 L5259 27
8129 L8328 21
8156 L5224 22
8202 L2343 25
8411 L6110 27
8642 L2769 12
9152 L5485 2
9220 L4063 9
9241 L3288 13
9405 L3259 21
9551 L5777 12
9614 L5924 2
9617 L5553 26
9635 L3191 24
9683 L7291 18
9735 L9633 16

For those who have matched but not one-on-one, we can plot their parallel graph.

Q3-Fig1 Code
# filter multiple match
card_correspond_count_others <- card_correspond_count %>%
  filter(n_distinct(last4ccnum)>1 | n_distinct(loyaltynum)>1)

card_correspond_count_others_plot <- card_correspond_count_others  %>%
  gather_set_data(1:2) %>%        # <- ggforce helper function
  arrange(x,last4ccnum,desc(loyaltynum))
# plot
ggplot(card_correspond_count_others_plot, 
       aes(x = x, id = id, split = y, value = count)) +
  geom_parallel_sets(aes(fill = last4ccnum), alpha = 0.7, 
                     axis.width = 0.2, n=100, strength = 0.5) +
  geom_parallel_sets_axes(axis.width = 0.25, fill = "gray95",
                          color = "gray80", size = 0.15) +
  geom_parallel_sets_labels(colour = 'gray35', size = 4.5, 
                            angle = 0, fontface="bold") +
  theme_minimal() +
  theme(
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text.y = element_blank(),
    axis.text.x = element_text(size = 12, face = "bold"),
    axis.title.x  = element_blank()
    )
Multiple matched pairs

Figure 1: Multiple matched pairs

We can infer that the pair with a wider line is the true pair. For example, credit card 4795 (Blue line) matches both L2070 and L8566 loyalty cards, but we can make sure that L8566 is the true pair. Because the the consumption count by L8566, which is displayed as the line width, is much higher than L2070.

knitr::kable(card_correspond_count_others,
             caption = "Other matched pairs") %>% 
  kableExtra::kable_paper("hover", full_width = F) %>% 
  kableExtra::scroll_box(height = "300px")
Table 2: Other matched pairs
last4ccnum loyaltynum count
1286 L3288 15
1286 L3572 13
4795 L2070 1
4795 L8566 25
4948 L3295 1
4948 L9406 22
5368 L2247 24
5368 L6119 1
5921 L3295 12
5921 L9406 1
7889 L2247 1
7889 L6119 20
8332 L2070 27
8332 L8566 1

These matched pairs with 1 count might contain some suspicious activities. And we can assign the rows with over 5 count to be true pairs.

In the final predicted card pairs, only credit cards 1286 correspond to multiple loyalty cards (L3288, L3572), which can be found in Figure 1

card_correspond_count_others_ture <- card_correspond_count_others %>% 
  filter(count > 5)
# union the two true pairs table
card_correspond_true <- bind_rows(card_correspond_count_one2one, 
                                  card_correspond_count_others_ture)
Match GPS and credit card data

To match credit card consumption and GPS data, we can assume that one car stop corresponds to one consumption if the consumption time falls within the car stop period at the same location.

But before that, we have to label GPS of car stops with specific locations.

All car stop locations are plotted on the map. And we can see there are many locations where the car stopped for over 6 hours (red dot on the map). Most of them are near the five parks (along the coast).

We are interested in car stops where credit card consumption happened, so we should exclude these stops which are very likely at home.

Besides, we notice that there are some long car stop near other locations. Those blue dots near “Ouzeri Elian” all belong to Isande, car 28. And he/she drives car very regular: stops at about 8:00 and leave at about 17:00.

It’s the same for car 9, Gustav. The car has many long stops near “Bean There Done That” (north-west area): stopped at about 17:00 and start moving on the second day at about 8:00. It seems that he lives here.

Q3-Fig2 Code
gps2_stop_long <- gps2_stop_sf %>%
  filter(diff_mins >= 60*6)

gps2_stop_short <- gps2_stop_sf %>%
  filter(diff_mins < 60*6)

map4 <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(gps2_stop_short) +
  tm_dots(size = 0.1, alpha = 0.5) +
  tm_shape(gps2_stop_long) +
  tm_dots(col = "blue", size = 0.2, alpha = 0.3)
tmap_leaflet(map4)

Figure 2: Long and short car stop

After excluding the long on short car stop, We can find that these car stops are still messy in Figure 2. It’s difficult to distinctly group car stops and label with locations.

Many car stops locations might not correspond to any local business locations. And some car stops, which actually correspond consumption in one location, don’t have close GPS location. Furthermore, some locations are very near each other.

To fix this issue, we can find the most confident correspondence firstly. For example, the multiple dots near the “Abila Airport”(southwest) should be a clear/distinct group. They are not single stops, close within one group and far from other dots&locations. Thus, We can believe that they correspond to the airport location with high confidence.

It’s the same for “Maximum Iron and Steel”(west), “Abila Scrapyard”(northwest), “Frank’s Fuel”(west), “Bean There Done That”(northwest), “Coffee Cameleon”(southeast), “Chostus Hotel”(northeast).

Some dots might not be a distinct group and there are several dot groups near one location. But some groups still can be labeled to one location with confidence, such as some dots along the street of the “Kronos Mart”(west). “Roberts and Sons”(west), “Desafio Golf Course”(northeast), “Albert’s Fine Clothing”(north), “Jack’s Magical Beans”(northeast), “Hallowed Grounds”(east) also have such dots groups.

After labeling these dots groups, we can match the credit card by the timestamp and location labels. It give us possible pairs of car id and credit card. For each pair, we will check whether every record has a unique corresponding a car stop record with this pair among 14 days. If all matches, this pairs will be regard as a confident pair.

# add distinct index to select GPS records later
gps2_stop_short$idx <- c(1:2659)
# calculate the distance between any two dot
distance_matrix <- st_distance(gps2_stop_short$geometry, gps2_stop_short$geometry)
# add a new col for labeling locations
gps2_stop_short$location <- ""

Take the dots group near “Frank’s Fuel”(west) as example:

We can hover in the map above to find one of dots close to the location and get the distinct ‘idx’ of this car stop.

Then we need to filter out all dots which belong to this group. To do this, we find all dots whose distances with it are less than 50 meters. And plot them on the map to check whether there are dots missing or the range of 50 meters is so large that it includes other others.

## "Frank's Fuel"(west), 
# start with dot 2545, which is near this location
# find all dots which have less than 50 meter in distance
frank_idx <- which(as.integer(distance_matrix[2545,]) < 50)
dot_group <- gps2_stop_short %>%
  filter(idx %in% frank_idx)
others <- gps2_stop_short %>%
  filter(!idx %in% frank_idx)
# check whether there are other dots near this group but with different labels (blue dots) on the map
# if yes, we need to change the dot distance from default value(50) to a bigger value
mapx <- tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
         alpha = NA,
         saturation = 1,
         interpolate = TRUE,
         max.value = 255) +
  tm_shape(dot_group) +
  tm_dots(size = 0.1, alpha = 0.4) +
  tm_shape(others) +
  tm_dots(col = "blue", size = 0.2, alpha = 0.5)
tmap_leaflet(mapx)

After filtering this dot group, we will label them with location “Frank’s Fuel” and match with consumption records in credit cards by location and timestamp(the consumption time need to be within the start time and the end time of the car stop). It will give us possible pairs of one car and one credit card.

# label them with this location
gps2_stop_short$location[gps2_stop_short$idx %in% frank_idx] <- "Frank's Fuel"
# match records in cc
gps2_stop_short %>% 
  filter(idx %in% frank_idx) %>% 
  left_join(cc, by = c('location')) %>% 
  filter(timestamp > start & timestamp < end) %>% 
  select(id, start, end, timestamp, last4ccnum)
Simple feature collection with 2 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.84133 ymin: 36.07212 xmax: 24.84135 ymax: 36.07213
Geodetic CRS:  WGS 84
# A tibble: 2 x 6
     id start               end                 timestamp          
  <dbl> <dttm>              <dttm>              <dttm>             
1    15 2014-01-08 11:45:01 2014-01-08 12:33:01 2014-01-08 12:29:00
2     3 2014-01-18 18:07:01 2014-01-18 18:40:01 2014-01-18 18:39:00
# ... with 2 more variables: last4ccnum <dbl>, geometry <POINT [°]>
gps2_stop_short %>% 
  filter(idx %in% frank_idx) %>% 
  left_join(cc, by = c('location')) %>% 
  filter(timestamp > start & timestamp < end)%>% 
  select(id, start, end, timestamp, last4ccnum)
Simple feature collection with 2 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.84133 ymin: 36.07212 xmax: 24.84135 ymax: 36.07213
Geodetic CRS:  WGS 84
# A tibble: 2 x 6
     id start               end                 timestamp          
  <dbl> <dttm>              <dttm>              <dttm>             
1    15 2014-01-08 11:45:01 2014-01-08 12:33:01 2014-01-08 12:29:00
2     3 2014-01-18 18:07:01 2014-01-18 18:40:01 2014-01-18 18:39:00
# ... with 2 more variables: last4ccnum <dbl>, geometry <POINT [°]>

we find 2 possible pairs. Let’s check them separately.

For car id 15 and cc 3853:

# match all records of car id 15, cc 3853
match_cc <- subset(cc,last4ccnum == 3853) %>% 
  left_join(subset(gps2_stop_short, id == 15, select = c(start,end,idx, day)),
            by = c('day')) %>% 
  filter(timestamp > start & timestamp < end)

Check whether there are cc records which match multiple pairs.

match_cc %>% 
  group_by(idx) %>% 
  summarize(count = n()) %>% 
  filter(count >1)
# A tibble: 0 x 2
# ... with 2 variables: idx <int>, count <int>

Check whether there are any cc records which are not matched.

subset(cc,last4ccnum == 3853) %>%
  setdiff(match_cc[1:7])
# A tibble: 0 x 7
# ... with 7 variables: timestamp <dttm>, location <chr>,
#   price <dbl>, last4ccnum <dbl>, date <date>, day <int>, hour <int>
# All match, so label them in GPS with respective locations
gps2_stop_short$location[gps2_stop_short$idx %in% match_cc$idx] <- match_cc$location

All records match, we think the credit card 3853 belongs to the owner of the car 15.

For car id 3, cc 9635, the steps are the same.

# match all records of car id 3, cc 9635
match_cc <- subset(cc,last4ccnum == 9635) %>% 
  left_join(subset(gps2_stop_short, id == 3, select = c(start,end,idx, day)),
            by = c('day')) %>% 
  filter(timestamp > start & timestamp < end)
# check whether there are cc records which match multiple car stops
match_cc %>% 
  group_by(idx) %>% 
  summarize(count = n()) %>% 
  filter(count >1)
# A tibble: 0 x 2
# ... with 2 variables: idx <int>, count <int>
# check whether there are any cc records which are not matched
subset(cc,last4ccnum == 9635) %>% 
  setdiff(match_cc[1:7])
# A tibble: 2 x 7
  timestamp           location price last4ccnum date         day  hour
  <dttm>              <chr>    <dbl>      <dbl> <date>     <int> <int>
1 2014-01-14 12:00:00 Bean Th~ 19.4        9635 2014-01-14    14    12
2 2014-01-15 12:00:00 Bean Th~  5.34       9635 2014-01-15    15    12
# 2 out of 26 not match, but their locations are "Bean There Done That", it's still confident pair
# label them in GPS with respective locations
gps2_stop_short$location[gps2_stop_short$idx %in% match_cc$idx] <- match_cc$location

The credit card 9635 belongs to the owner of the car 3.

Other dot groups will go through the same process to find pairs of cars and credits.

Mannual work
### confident pair

### cars&consumption with pairs

### match new pairs

### 

Q4: Potential Relationships

Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships

Q4-Fig1
1
[1] 1
Q4-Fig2
1
[1] 1
Q4-Fig3
1
[1] 1

Q5: Suspicious Activities

Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why.

Q5-Fig1
1
[1] 1
Q5-Fig2
1
[1] 1
Q5-Fig3
1
[1] 1

trucks which were used for non-business issue

day*hour Now, let’s divide the units from days into hours:

# cc_freq_day_hour <- as.data.frame(xtabs(~location++day+hour, data = cc))
# cc_freq_day_hour$hour <- as.numeric(levels(cc_freq_day_hour$hour))[cc_freq_day_hour$hour]
# p3 <- ggplot(cc_freq_day_hour,aes(x=hour,y=location))+
#   geom_tile(aes(fill=Freq),color="white")+
#   scale_fill_gradient(low = "#EFF7FB", high = "#0D2330")+
#   theme(panel.background = element_blank(),
#         axis.ticks = element_blank(),
#         axis.title = element_blank(),
#         legend.title=element_blank(),
#         plot.title = element_text(hjust=0.5))+
#   facet_wrap(~ day, ncol = 7)+
#   labs(title = "CC Frequency by hour of the day") 
# ggplotly(p3)

Conclusion

# knitr::kable(card_correspong_count,
#              caption = ) %>%
#   kableExtra::kable_paper("hover", full_width = F)
1
[1] 1